1 Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.

Run the code below to load the file:

Notice that, when using this function, we added two options: skip and na.

  1. The skip=1 option is there as the real data table only starts in Row 2, so we need to skip one row.
  2. na = "***" option informs R how missing observations in the spreadsheet are coded. When looking at the spreadsheet, you can see that missing data is coded as "***". It is best to specify this here, as otherwise some of the data is not recognized as numeric data.

Once the data is loaded, notice that there is a object titled weather in the Environment panel. If you cannot see the panel (usually on the top-right), go to Tools > Global Options > Pane Layout and tick the checkbox next to Environment. Click on the weather object, and the dataframe will pop up on a seperate tab. Inspect the dataframe.

For each month and year, the dataframe shows the deviation of temperature from the normal (expected). Further the dataframe is in wide format.

You have two objectives in this section:

  1. Select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment. Hint: use select() function.

  2. Convert the dataframe from wide to ‘long’ format. Hint: use gather() or pivot_longer() function. Name the new dataframe as tidyweather, name the variable containing the name of the month as month, and the temperature deviation values as delta.

Inspect your dataframe. It should have three variables now, one each for

  1. year,
  2. month, and
  3. delta, or temperature deviation.

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

In the following chunk of code, I used the eval=FALSE argument, which does not run a chunk of code; I did so that you can knit the document before tidying the data and creating a new dataframe tidyweather. When you actually want to run this code and knit your document, you must delete eval=FALSE, not just here but in all chunks were eval=FALSE appears.

Is the effect of increasing temperature more pronounced in some months? Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.

The above-depicted scatter plots clearly depict that the effect of increasing temperature is more pronounced in Q1 and Q4, particularly November, December, February and March. This can potentially be linked to climate change associated extreme weather conditions leading to heatwaves and relatively higher tempurates around the world.

It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

Inspect the comparison dataframe by clicking on it in the Environment pane.

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.

So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.

1.2 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.

## # A tibble: 1 × 7
##   mean_variation sd_variation count_variation se_variation t_critical
##            <dbl>        <dbl>           <int>        <dbl>      <dbl>
## 1           1.06        0.274             132       0.0239       1.98
## # … with 2 more variables: lower_95p_variation <dbl>, upper_95p_variation <dbl>
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.01     1.11
## # A tibble: 1 × 2
##   lower_95p_variation upper_95p_variation
##                 <dbl>               <dbl>
## 1                1.01                1.11

What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!

The results is the same up to at least two decimal points. Using the formula (first part), we came to a lower 95% confidence interval of 1.01 and an upper boundary of 1.11, which means that we are 95% confident that additional results would be inside that range (i.e. temperature deviation would be in this range). Using the bootstrapping method, we simulate 1000 datasets from our sample to get a new distribution that helps us to set these upper and lower boundaries.

2 Global warming and political views (GSS)

A 2010 Pew Research poll asked 1,306 Americans, “From what you’ve read and heard, is there solid evidence that the average temperature on earth has been getting warmer over the past few decades, or not?”

In this exercise we analyze whether there are any differences between the proportion of people who believe the earth is getting warmer and their political ideology. As usual, from the survey sample data, we will use the proportions to estimate values of population parameters. The file has 2253 observations on the following 2 variables:

  • party_or_ideology: a factor (categorical) variable with levels Conservative Republican, Liberal Democrat, Mod/Cons Democrat, Mod/Lib Republican
  • response : whether the respondent believes the earth is warming or not, or Don’t know/ refuse to answer

You will also notice that many responses should not be taken into consideration, like “No Answer”, “Don’t Know”, “Not applicable”, “Refused to Answer”.

## # A tibble: 3 × 5
## # Rowwise: 
##   party_or_ideology                warming total lower_ci upper_ci
##   <chr>                              <int> <int>    <dbl>    <dbl>
## 1 Conservative Republican              248   698    0.320    0.392
## 2 Liberal Democrat                     405   428    0.919    0.965
## 3 Moderate_Domocrat_and_republican     698   991    0.675    0.732

We will be constructing three 95% confidence intervals to estimate population parameters, for the % who believe that Earth is warming, accoridng to their party or ideology. You can create the CIs using the formulas by hand, or use prop.test()– just rememebr to exclude the Dont know / refuse to answer!

Does it appear that whether or not a respondent believes the earth is warming is independent of their party ideology? You may want to

The aforementioned table depicts that climate beleifs, in particular whether or not a respondent believes the earth is warming, is indeed influenced by the respondants party or ideology. The 95% confidence intervals for average percentage of those believe in earth-warming seem to differ a lot among parties. The data suggests that, relatively, Liberal Democrats are more likely to believe that earth warming is true and conservative republicans are least likely to believe in earthc warming.

You may want to read on The challenging politics of climate change

3 Biden’s Approval Margins

As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval

## Rows: 1,819
## Columns: 22
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "10/1/2021", "10/1/2021", "10/1/2021", "10/1/2021"…
## $ startdate           <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate             <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/21/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade               <chr> "B", "B", "B+", "B", "B-", "B", "B+", "B-", "B", "…
## $ samplesize          <dbl> 1500, 15000, 1516, 1993, 1115, 15000, 941, 1200, 1…
## $ population          <chr> "lv", "a", "a", "rv", "a", "a", "rv", "rv", "lv", …
## $ weight              <dbl> 0.3382, 0.2594, 1.2454, 0.0930, 1.1014, 0.2333, 1.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48, 50, 45, 56, 55, 51, 63, 58, 48, 52, 53, 55, 48…
## $ disapprove          <dbl> 45, 28, 28, 31, 32, 28, 37, 32, 47, 29, 29, 33, 47…
## $ adjusted_approve    <dbl> 50.5, 48.6, 46.5, 54.6, 53.9, 49.6, 58.7, 56.9, 50…
## $ adjusted_disapprove <dbl> 38.8, 31.3, 28.4, 34.3, 32.9, 31.3, 38.0, 33.1, 40…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, TRUE, NA, NA, NA, TRUE, NA, NA, TRUE, TRUE, …
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74247, 74272, 74327, 74246, 74248, 74273, 74256, 7…
## $ question_id         <dbl> 139395, 139491, 139570, 139394, 139404, 139492, 13…
## $ createddate         <chr> "1/22/2021", "1/28/2021", "2/2/2021", "1/22/2021",…
## $ timestamp           <chr> "10:23:09  1 Oct 2021", "10:23:09  1 Oct 2021", "1…
## # A tibble: 1,819 × 22
##    president subgroup modeldate  startdate  enddate    pollster grade samplesize
##    <chr>     <chr>    <date>     <date>     <date>     <chr>    <chr>      <dbl>
##  1 Joseph R… All pol… 2021-10-01 2021-01-19 2021-01-21 Rasmuss… B           1500
##  2 Joseph R… All pol… 2021-10-01 2021-01-19 2021-01-21 Morning… B          15000
##  3 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 YouGov   B+          1516
##  4 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 Morning… B           1993
##  5 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-21 Ipsos    B-          1115
##  6 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-22 Morning… B          15000
##  7 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-22 HarrisX  B+           941
##  8 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-23 RMG Res… B-          1200
##  9 Joseph R… All pol… 2021-10-01 2021-01-20 2021-01-24 Rasmuss… B           1500
## 10 Joseph R… All pol… 2021-10-01 2021-01-21 2021-01-23 Morning… B          15000
## # … with 1,809 more rows, and 14 more variables: population <chr>,
## #   weight <dbl>, influence <dbl>, approve <dbl>, disapprove <dbl>,
## #   adjusted_approve <dbl>, adjusted_disapprove <dbl>, multiversions <chr>,
## #   tracking <lgl>, url <chr>, poll_id <dbl>, question_id <dbl>,
## #   createddate <date>, timestamp <chr>

3.1 Create a plot

What I would like you to do is to calculate the average net approval rate (approve- disapprove) for each week since he got into office. I want you plot the net approval, along with its 95% confidence interval. There are various dates given for each poll, please use enddate, i.e., the date the poll ended.

Also, please add an orange line at zero. Your plot should look like this:

## # A tibble: 31 × 7
##     week mean_appr_disappr sd_appr se_appr t_critical lower_95p_appr
##    <dbl>             <dbl>   <dbl>   <dbl>      <dbl>          <dbl>
##  1     9              12.9    7.24   1.07        2.01          10.7 
##  2    10              14.3    7.38   1.05        2.01          12.2 
##  3    11              16.1    8.67   1.28        2.01          13.5 
##  4    12              14.0    8.52   1.14        2.00          11.7 
##  5    13              13.0    8.57   1.24        2.01          10.5 
##  6    14              12.6    8.39   1.34        2.02           9.90
##  7    15              12.1    6.86   0.934       2.01          10.2 
##  8    16              13.3    6.71   0.819       2.00          11.7 
##  9    17              14.2    8.47   1.14        2.00          11.9 
## 10    18              14.0    7.57   1.03        2.01          11.9 
## # … with 21 more rows, and 1 more variable: upper_95p_appr <dbl>

We excluded everything before week 8 because we only have relatively lower observations in week 7, which leads to a relatively large confidence interval.

3.2 Compare Confidence Intervals

Compare the confidence intervals for week 3 and week 25. Can you explain what’s going on? One paragraph would be enough.

confidence interval for week 8 and week 3 (not included in the graph) is much wider than that of week 25 as there is not enough sample data from week 8 and therefore the variance of margins across different samples in week 25 is small. (please also note - Referring to the question asked on Slack, as we have different data sets from the original graph, the interpretaion might vary as well)

4 Challenge 1: Excess rentals in TfL bike sharing

Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following

## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-09-23T12%3A52%3A20/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20211003%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20211003T214741Z&X-Amz-Expires=300&X-Amz-Signature=8b7e26d73d2c2d9738f9cab44e2a4a6914ad8adf85b0dfb36762db9b22911cb3&X-Amz-SignedHeaders=host]
##   Date: 2021-10-03 21:52
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 174 kB
## <ON DISK>  /var/folders/lp/pk7fq3n95sb6vdn3c_zq522w0000gn/T//RtmpBPuaeT/file12bf663aa0a37.xlsx

We can easily create a facet grid that plots bikes hired by month and year.

Look at May and Jun and compare 2020 with the previous years. What’s happening?

May and June in 2019 had much more days with bike rentals across the mean compared to 2020 with a distribution that looks flatter. Overall, it seems that May and June 2020 had less bike rentals in total, potentially attributable to bad weather. Similarly it is possible that Covid may have caused people to go outside less, resulting in a lot of days with low number of bike rentals.

However, the challenge I want you to work on is to reproduce the following two graphs.

The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

# average_weekly_bikes <- median(bike$bikes_hired, na.rm = TRUE) * 7
average_weekly_bikes <- bike %>% 
  filter(year %in% c("2016", "2017", "2018", "2019")) %>% 
  summarise(avg = median(bikes_hired, na.rm = TRUE) * 7)

bike_weekly <- bike %>% 
  group_by(year, week) %>% 
  summarise(diff = (sum(bikes_hired) - average_weekly_bikes$avg)/average_weekly_bikes$avg) %>% 
  filter(year %in% c("2016", "2017", "2018", "2019", "2020", "2021"), week <= 52)

tfl_colors <- c("grey" = "grey",
                "below" = "red",
                "above" = "green",
                "positive" = "green",
                "negative" = "red")

ggplot(bike_weekly, aes(x=week, y=diff)) +
  geom_rect(aes(xmin = 14, xmax = 26, ymin = -Inf, ymax = Inf), fill = "grey", alpha = 0.01) +
  geom_rect(aes(xmin = 40, xmax = 52, ymin = -Inf, ymax = Inf), fill = "grey", alpha = 0.01) +
  geom_line() +
  geom_ribbon(aes(ymin=0, ymax=pmin(diff,0), fill="below", alpha = 0.3)) +
  geom_ribbon(aes(ymin=0, ymax=pmax(0, diff), fill="above", alpha = 0.3)) +
  facet_wrap(~year) +
  #scale_fill_manual(values = c('green', 'red')) +
  theme_bw() +
  labs(x="Week",
       y="Deviation",
       title="Weekly changes in TfL bike rentals",
       subtitle = "% change from weekly averages calculated between 2010-2021"
  ) +
  theme(legend.position = "none") +
  geom_rug(data= subset(bike_weekly, diff >= 0), aes(color="positive"), sides="b") +
  geom_rug(data= subset(bike_weekly, diff <= 0), aes(color="negative"), sides="b") +
  scale_fill_manual(values = tfl_colors) +
  scale_color_manual(values = tfl_colors) +
  ylim(-0.5, 1) +
  scale_y_continuous(labels = scales::percent)

For both of these graphs, you have to calculate the expected number of rentals per week or month between 2016-2019 and then, see how each week/month of 2020-2021 compares to the expected rentals. Think of the calculation excess_rentals = actual_rentals - expected_rentals.

Should you use the mean or the median to calculate your expected rentals? Why?

The median makes more sense, as extreme results won’t affect it as the mean would do. For example in March 2018 we have an extreme result that would falsify the data.

In creating your plots, you may find these links useful:

5 Challenge 2: How has the CPI and its components changed over the last few years?

Remember how we used the tidyqant package to download CPI data. In this exercise, I would like you to do the following:

  1. You can find CPI components at FRED. You should adapt the code from German polls to scrape the FRED website and pull all of the CPI components into a vector. FIY, the list of components is the second table in that webpage.
  2. Once you have a vector of components, you can then pass it to tidyquant::tq_get(get = "economic.data", from = "2000-01-01") to get all data since January 1, 2000
  3. Since the data you download is an index with various starting dates, you need to calculate the yearly, or 12-month change. To do this you need to use the lag function, and specifically, year_change = value/lag(value, 12) - 1; this means you are comparing the current month’s value with that 12 months ago lag(value, 12).
  4. I want you to order components so the higher the yearly change, the earlier does that component appear.
  5. You should also make sure that the All Items CPI (CPIAUCSL) appears first.
  6. Add a geom_smooth() for each component to get a sense of the overall trend. 1 You may want to colour the points according to whether yearly change was positive or negative.

Having done this, you should get this graph.

url <- "https://fredaccount.stlouisfed.org/public/datalist/843"

tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")

cpi <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())

cpi_components <- cpi[[2]]

econ_data <- cpi_components %>%
  select(series_id) %>% 
  pull() %>% 
  tq_get(get = "economic.data", from = "2000-01-01")

econ_data_detailed <- econ_data %>% 
  mutate(year_change = price / lag(price, 12) - 1)

avg_year_chg <- econ_data_detailed %>% 
  group_by(symbol) %>% 
  summarize(avg_year_change = mean(year_change, na.rm = TRUE))


econ_data_detailed <- left_join(econ_data_detailed, avg_year_chg, by = "symbol") %>% 
  arrange(desc(avg_year_change))

#econ_data_detailed[1] <- econ_data_detailed[econ_data_detailed$symbol == "CPIAUCSL"]


econ_data_detailed <- left_join(econ_data_detailed, cpi_components, by = c("symbol" = "series_id"))

econ_data_detailed$title <- str_replace(econ_data_detailed$title,
                                        "Consumer Price Index for All Urban Consumers: ",
                                        "")

econ_data_detailed$title <- str_replace(econ_data_detailed$title,
                                        " in U.S. City Average",
                                        "")

new_order <- c(unique(econ_data_detailed$title))

new_order2 <- c(new_order[30], new_order[-30])

econ_data_detailed <- arrange(transform(econ_data_detailed,
                                        title = factor(title, levels = new_order2)), title)


ggplot(econ_data_detailed, aes(x=date, y = year_change)) +
  geom_point(aes(color = factor(year_change <= 0), alpha = 0.1)) +
  #geom_point(aes(color="negative", alpha = 0.1)) +
  geom_smooth(type="loess", color="lightblue") +
  facet_wrap(~title, scales="free") +
  labs(x = "Year",
       y = "YoY % Change",
       title = "Yearly change of US CPI (All Items) and its components",
       subtitle = "YoY change being positive or negative") +
  scale_color_manual(values = c("red", "lightblue")) +
  theme(legend.position = "none") +
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  theme(legend.position = "none")

This graphs is fine, but perhaps has too many sub-categories. You can find the relative importance of components in the Consumer Price Indexes: U.S. city average, December 2020 here. Can you choose a smaller subset of the components you have and only list the major categories (Housing, Transportation, Food and beverages, Medical care, Education and communication, Recreation, and Apparel), sorted according to their relative importance?

We visited the website for the source data and noticed that our four chosen buckets are assinged the greatest CPI weight and therefore included them.

6 Deliverables

As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

7 Details

  • Who did you collaborate with: Mina Jia, Clara Wang, Julien Marécaux, Mariana Pino, Mina Jia, Moritz Gort, Olsi Arapi
  • Approximately how much time did you spend on this problem set: 10h
  • What, if anything, gave you the most trouble: Challenge 2

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.